-
Notifications
You must be signed in to change notification settings - Fork 159
Fix UTF-8 decoding of lazy bytestrings #333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I don't understand the doctest errors on GHC 8+, and on GHC 7+ the errors are that old versions of bytestring did not have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doctests are failing in master as well (I guess it has something to do with a new version of doctest package?.. Dunno)
Could you please check a coverage report to ensure that all lines are well-tested?
b42b59c to
de7c071
Compare
|
That's probably fine, I guess. Looks good to me except GHC < 7.6 builds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
|
(Found a silly space.) |
|
@Lysxia could you please resolve a conflict? |
At the beginning of a new chunk we may be trying to complete a UTF-8 sequence started in the previous chunk (contained in the `undecode0` buffer). If it turns out to be invalid, we must apply the `onErr` handler to every character in that buffer. When we reach the end of the chunk, we must also be more careful about when to keep the previous buffer: a UTF-8 sequence (up to 4 bytes) can span more than two chunks, when those chunks are very short (of length 0, 1, or 2).

Fixes #330.
I found another issue that's not yet fixed: both strict and lazy
decodeUtf8Withare actually memory unsafe if you use a badonErrargument. We allocate a destination buffer with 2x the number of bytes from the original bytestrings, but by using anonErrfunction which replaces any invalid byte with aCharwhich is a surrogate pair in UTF-16, it possible to blow up the size taken by aTextto 4x. In practice,onErris almost alwayslenientDecodethough, so perhaps a better solution than allocating more memory is to either hidedecodeUtf8Withor clamp the range ofonErr.